1. Train_test_split
Underfitting/High Bias: This form of hypothesis function h maps poorly to the trend of the data.
Reason :
1. Function that is too simple
2. Uses too few features
Overfitting/High Variance: Tjis forms the hypothesis function that fits the available data but does not generalize well to predict new data
1. Reduce the number of features:
a. Manually select which features to keep
b. Use model selection algorithm
2. Regulariation
a. Keep all the features, but reduce the magnitude of parameters ThetaJ
b. Regularization works well when we have a lot of slightly useful features
What is Linear Regression?
Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables.
$ Y = βo + β1X + ∈ $
1. Y - This is the variable we predict
2. X - This is the variable we use to make a prediction
3. βo - This is the intercept term. It is the prediction value you get when X = 0
4. β1 - This is the slope term. It explains the change in Y when X changes by 1 unit. ∈ - This represents the residual value, i.e. the difference between actual and predicted values.
5. Error reduction Techniques
a. Ordinary Least Square - ∑[Actual(y) - Predicted(y')]² Why OLS?
i. It uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent
ii.OLS is easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features
iii.Interpretation of OLS is much easier than other regression techniques
b. Generalized Least Square
c. Percentage Least Square
d. Total Least Square
e. Least absolute deviation
Formula for calculating the coefficients
$β1 = Σ(xi - xmean)(yi-ymean)/ Σ (xi - xmean)²$ where i= 1 to n (no. of obs.)
$βo = ymean - β1(xmean)$
In [196]:
# Case Study : Predicting Housing Price
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [197]:
df = pd.read_csv("C:/Users/melvin/Machine Learning/Linear Regression/USA_Housing.csv")
In [198]:
# Summary
df.head()
Out[198]:
In [191]:
df.info()
In [192]:
# Validating Linear Regression Assumptions
df.describe()
Out[192]:
In [193]:
df.columns
Out[193]:
In [194]:
sns.pairplot(df)
Out[194]:
In [195]:
sns.distplot(df['Price'])
Out[195]:
In [89]:
sns.heatmap(df.corr(),annot=True)
Out[89]:
1. There exists a linear and additive relationship between dependent (DV) and independent variables (IV)
2. Multicollinearity - present of correlation b/w independent variables
3. Heteroskedestacity- Absence of constant variance in the error terms
4. Autocorrelation - Presences of correlation in error terms
5. The dependent variable and the error terms should possess a normal distribution
In [181]:
sns.pairplot(df)
Out[181]:
In [17]:
df.columns
Out[17]:
In [100]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population']]
y= df['Price']
In [113]:
# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)
print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))
In [28]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
In [33]:
lm.fit(X_train,y_train)
print('Intercept = ', lm.intercept_ ,'Coefficients = ',lm.coef_)
In [138]:
import seaborn as sns
anscombe = sns.load_dataset("anscombe")
rs = np.random.RandomState(7)
x= rs.normal(2,1,75)
y= 2 +1.5*x+rs.normal(0,2,75)
sns.residplot(x,y,lowess=True)
Out[138]:
In [91]:
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()
boston.feature_names
Out[91]:
In [152]:
'''
import statsmodels.api as sm
model = sm.OLS(y_train,X_train).fit()
predictions = model.predict(X_test)
# Print out the statistics
model.summary()
'''
Out[152]:
In [169]:
X = boston.data
y = boston.target
# Splitting the dataset into training & test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=2)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
print('R-Square =',(lm.score(X_train,y_train) * 100))
In [ ]:
In [182]:
predictions = lm.predict(X_test)
accuracy = (metrics.r2_score(y_test,predictions))
print('R-Square',accuracy*100)
In [176]:
plt.scatter(y_test,predictions)
Out[176]:
In [118]:
import seaborn as sns
sns.distplot((y_test-predictions))
Out[118]:
In [187]:
from sklearn import metrics
print('MAE = ', metrics.mean_absolute_error(y_test,predictions))
print('MAE = ',metrics.mean_squared_error(y_test,predictions))
print('MAE = ',np.sqrt(metrics.mean_squared_error(y_test,predictions)))
print('R-Square =',metrics.explained_variance_score(y_test,predictions))
Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model. Let's look at them one by one:
MSE - This is mean squared error. It tends to amplify the impact of outliers on the model's accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30-10)² = 400.
MAE - This is mean absolute error. It is robust against the effect of outliers. Using the previous example, the resultant MAE would be (30-10) = 20
RMSE - This is root mean square error. It is interpreted as how far on an average, the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data. Here, the resultant RMSE would be √(30-10)² = 20. Don't get baffled when you see the same value of MAE and RMSE. Usually, we calculate these numbers after summing overall values (actual - predicted) from the data.
Logistic Regression belongs to the family of generalized linear models. It is a binary classification algorithm used when the response variable is dichotomous (1 or 0).
Examples:
1. Ham/Spam
2. Loan Defaulters(Yes/No)
3. Disease Diagnosis
1. The response variable must follow a binominal distribution
2. Logistic Regression assumes a linear relationship between the independent variables and the link function(logit)
3. The dependent variable should have mutually exclusive and exhaustive categories
Note : The linear function results in probabilities outside [0,1] interval, thereby making them invalid predictions
1. Multinomial Logistics Regression
2. Ordinal Logistics Regression
The technique handles the multi-class problem by fitting K-1 independent binary logistic classifier model.
Drawback:
a. It doesn't scale well in the presence of a large number of target classes
b. Requires a larger dataset to achieve reasonable accuracy
This technique is used when the target variable in ordinal in nature (Eg. years of work experience 5>4>3>2>1). Ordinal Logistics Regression builds a single model with mutiple threshold values.
If we have K classes, the model will require K -1 threshold or cutoff points. Also, it makes an imperative assumption of proportional odds. The assumption says that on a logit (S shape) scale, all of the thresholds lie on a straight line.
Note: Logistic Regression is not a great choice to solve multi-class problems. But, it's good to be aware of its types. In this tutorial we'll focus on Logistic Regression for binary classification task.
a. A Unit change in input feature doesn't really affect the model output directly but it affects the odds ratio
b. We use Maximum likehood method to determine the best coefficients and eventually a good model fit.(Tries to find values of βo and β1 such that the resultant probabilities are closest to either 1 or 0)
1. Akaike Information Criteria
a. Counter part of Adjusted R-Square in multiple regression
b. Smaller the better
c. Adding more variables to the model wouldn't let AIC increase. It helps overfitting
Note: Looking at one AIC metric of one model would help. So build 2 or 3 Logistic Regression models and compare their AIC
2. Null Deviance and Residual Deviance
a. Deviance of an observation is computed as -2 times log likelihood of that observation
b. The null model predicts class via a constant probability
c. Residual deviance is calculated from the model having all the features
d. Results:
i. The larger the difference between null and residual deviance, better the model
ii. Lower null deviance, means that the model explains deviance pretty well, and is a better model
iii. Lower the residual deviance, better the model
3. Confusion Matrix
1 0
(Predicted) (Predicted)
1
(Actual) TP FN (Type 2)
0
(Actual) FP(Type 1) TN
Metrics:
Accuracy :It determines the overall predicted accuracy of the model.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
True Positive Rate/Sensitivity/Recall : It indicates how many positive values, out of all the positive values.have
been correctly predicted
Sensitivity/Recall = TP / TP + FN
False Negative Rate : 1- Sensitivity
True Negative Rate /Specificity: It indicates how many negative values, out of all the negative values, have
been correctly predicted
FP Rate : 1 - Specificity
Precision : It indicates how many values, out of all the predicted positive values, are actually positive.
Precision = TP/ TP+ FP
F Score: F score is the harmonic mean of precision and recall.
It lies between 0 and 1. Higher the value, better the model.
It is formulated as 2((precision*recall) / (precision+recall))
4. Receiver Operator Characteristic (ROC)
ROC determines the accuracy of a classification model at a user defined threshold value. It determines the model's accuracy using Area Under Curve (AUC).
Measure : Higher the area, better the model
In [347]:
# Case Study Titanic Dataset:
# Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [430]:
# Importing the requried training dataset
train= pd.read_csv("C:/Users/melvin/Machine Learning/Logistic Regression/train.csv")
In [431]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[431]:
Findings :
1. We are missing a lot of Cabin information
2. A lot of Age information is also missing
3. One Embarked informatio is missing
Solution:
1. For Age we can go for missing value imputation
2. We can drop or tranform it into Categorial variables like Known/Unknown
In [432]:
sns.set_style('whitegrid')
In [433]:
sns.countplot(x='Survived',hue='Sex',data=train)
Out[433]:
In [434]:
sns.countplot(x='Survived',hue='Pclass',data=train)
Out[434]:
In [435]:
sns.distplot(train['Age'].dropna(),bins=30,kde = False)
Out[435]:
In [436]:
train.info()
In [437]:
sns.countplot(x='SibSp',data=train,hue='Survived')
Out[437]:
In [438]:
Finding :
1. We are able to see that the people who had no siblings or 1 siblings have died. Its opposite of what I thought.
In [439]:
'''
import cufflinks as cf
cf.go_offline()
train['Fare'].iplot(kind='hist',bins=50)
'''
Out[439]:
In [440]:
plt.figure(figsize=(10,7))
sns.boxplot(y='Age',x='Pclass',data=train)
Out[440]:
In [441]:
def impute_age(cols):
Age = cols[0]
Pclass = cols[0]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
In [442]:
plt.figure(figsize=(18,8))
sns.heatmap(train.isnull(),yticklabels=False,cmap='viridis')
Out[442]:
In [443]:
train.drop('Cabin',axis=1,inplace=True)
In [444]:
train.dropna(inplace=True)
In [445]:
# Converting the Categorcial Variable into dummy variable
sex = pd.get_dummies(train['Sex'],drop_first=True)
sex.count()
Out[445]:
In [446]:
embark = pd.get_dummies(train['Embarked'],drop_first=True)
embark.count()
Out[446]:
In [447]:
train = pd.concat([train,sex,embark],axis=1)
train
Out[447]:
In [448]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
In [449]:
train.head()
Out[449]:
In [450]:
train.drop(['PassengerId'],axis=1,inplace=True)
In [451]:
train
Out[451]:
In [452]:
train.count()
Out[452]:
In [453]:
X = train.drop('Survived',axis=1)
y = train['Survived']
In [456]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.20,random_state = 1)
In [464]:
from sklearn.linear_model import LogisticRegression
Logistic_M = LogisticRegression(n_jobs=5)
In [465]:
Logistic_M.fit(X_train,y_train)
Out[465]:
In [466]:
predictions = Logistic_M.predict(X_test)
In [467]:
from sklearn.metrics import classification_report
In [468]:
print(classification_report(y_test,predictions))
In [469]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
Out[469]:
KNN is a classification algorithm.
How it works?
1.
Pro's:
1. Very Simple
2. Training is trivial
3. Works with any number of classes
4. Easy to add more data
5. Few parameter
a. K
b. Distance Metric
Con's :
1. High Prediction Cost
2. Not good with high dimensional data
3. Categorial Features don't work well
KNN Use Case :
In [3]:
# importing the requried Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [5]:
# Importing the Requried Dataset
data = pd.read_csv("C:/Users/melvin/Machine Learning/KNN/KNN_Project_Data.csv")
data.head()
Out[5]:
In [6]:
sns.pairplot(data=data)
Out[6]:
In [7]:
# Standardize the Variables
In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=data.columns[:-1])
df_feat.head()
Out[8]:
In [10]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features,data['TARGET CLASS'],test_size=0.30)
from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier()
Knn.fit(X_train,y_train)
prediction = Knn.predict(X_test)
In [11]:
# Metrics
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))
In [12]:
# Create a KNN model instance with n_neighbors=n
error_rate = []
# Will take some time
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
In [13]:
# Fit this KNN model to the training data
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[13]:
In [18]:
from sklearn.neighbors import KNeighborsClassifier
Knn = KNeighborsClassifier(n_neighbors=20)
Knn.fit(X_train,y_train)
prediction = Knn.predict(X_test)
In [19]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))
Done!
In [ ]:
# Random Forest